Unsupervised methods

In this lab, we’ll look at how to implement two unsupervised classification methods:

  • \(k\)-means classification
  • Self-organizing maps

The data we will use is from the Gap Minder project. While you can download the individual variables from the web site, we will use a preprocessed set of the data covering the period 1801-2018, in the file gapminder_1800_2018.csv. Download this to your datafiles folder and make a new folder for today’s class called lab11. You will also need the shapefile of country borders: ne_50m_admin_0_countries.shp (you should have this from a previous lab).

R users

Start RStudio and set the working directory to this directory (This can be changed by going to the [Session] menu and selecting [Set working directory] -> [Choose directory…], then browsing to the folder).

You will need the following packages for today’s lab, so make sure to install anything that is missing before proceeding.

  • dplyr
  • tidyr
  • ggpubr
  • RColorBrewer
  • countrycode
  • sf
  • kohonen
  • tmap

Python users

If you are using Python for today’s lab, you’ll need to download the Jupyter notebook for this lab (GEOG_5160_6160_lab08.ipynb). Make a new folder for the lab (lab08) and move the notebook to this. Now open a new terminal (in Windows go to the [Start Menu] > [Anaconda (64-bit)] > [Anaconda prompt]).

You will need to make sure the following packages are installed on your computer (in addition to the packages we have used in previous labs).

  • xarray: functions for working with regular arrays (conda install xarray)
  • xgboost: extreme gradient boosting (conda install py-xgboost)

Once opened, change directory to the folder you just made, activate your conda environment

conda activate geog5160

Start the Jupyter Notebook server:

jupyter notebook

And open the notebook for today’s class.

Unsupervised classification in R

Once you have set everything up, we can start by reading in the data.

gap_df = read.csv("../datafiles/gapminder_1800_2018.csv")
head(gap_df)
##       country year child_mortality fertility per_cap_co2 income life_expectancy
## 1 Afghanistan 1800             469         7          NA    603            28.2
## 2 Afghanistan 1801             469         7          NA    603            28.2
## 3 Afghanistan 1802             469         7          NA    603            28.2
## 4 Afghanistan 1803             469         7          NA    603            28.2
## 5 Afghanistan 1804             469         7          NA    603            28.2
## 6 Afghanistan 1805             469         7          NA    603            28.2
##   population
## 1    3280000
## 2    3280000
## 3    3280000
## 4    3280000
## 5    3280000
## 6    3280000

Now load the following directories to process and visualize the data:

library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggpubr)
## Loading required package: ggplot2
library(RColorBrewer)

We can now make some plots to look at the data. First, histograms of the three variables in the data (gdp per capita, population and life expectancy):

names(gap_df)
## [1] "country"         "year"            "child_mortality" "fertility"      
## [5] "per_cap_co2"     "income"          "life_expectancy" "population"
h1 <- gap_df %>%
  gghistogram(x = "child_mortality")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
h2 <- gap_df %>%
  gghistogram(x = "fertility")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
h3 <- gap_df %>%
  gghistogram(x = "per_cap_co2")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
h4 <- gap_df %>%
  gghistogram(x = "income")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
h5 <- gap_df %>%
  gghistogram(x = "life_expectancy")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
h6 <- gap_df %>%
  gghistogram(x = "population")
## Warning: Using `bins = 30` by default. Pick better value with the argument
## `bins`.
ggarrange(h1, h2, h3, h4, h5, h6)
## Warning: Removed 1650 rows containing non-finite values (stat_bin).
## Warning: Removed 3311 rows containing non-finite values (stat_bin).
## Warning: Removed 40656 rows containing non-finite values (stat_bin).
## Warning: Removed 12182 rows containing non-finite values (stat_bin).
## Warning: Removed 3167 rows containing non-finite values (stat_bin).

The histograms clearly show the skew in most of the variables, and we will need to transform the data before analyzing it. We’ll do this in two steps. First, we’ll log transform all variables to remove the skew, then use the scale() function, which in R scales data to \(z\)-scores by subtracting the mean and dividing by the standard deviation. To simplify things, we do all this in a single step by nesting the log() function in the scale() function.

Note that we remove rows with missing observations (we could potentially impute these, but that is a little more difficult for time series data). We also remove any row where the per capita CO2 emissions are zero, partly because this would cause problems when log-transforming and partly because this is clearly not true.

gap_df_scale <- gap_df %>% 
  drop_na() %>%
  filter(per_cap_co2 > 0) %>%
  mutate(
    child_mortality = scale(log(child_mortality)),
    fertility = log(fertility),
    per_cap_co2 = log(per_cap_co2),
    income = log(income),
    life_expectancy = log(life_expectancy),
    population = log(population)
  )

The very last thing we’ll do is to extract a single year of data for the first part of the lab (just 2018):

gap_df_scale2 <- gap_df_scale %>%
  filter(year == 2018)

\(k\)-means cluster analysis

We’ll next use these data with the \(k\)-means cluster algorithm. The default function for this in R is kmeans(), and we need to pass the data we are going to use, and the number of requested clusters to this function. First let’s figure out which columns we need:

names(gap_df_scale2)
## [1] "country"         "year"            "child_mortality" "fertility"      
## [5] "per_cap_co2"     "income"          "life_expectancy" "population"
gap_kmeans = kmeans(gap_df_scale2[,3:8], centers = 6)

There output of this algorithm provides information about the cluster solution. We can see the size of each cluster (the number of observations assigned to each cluster)

gap_kmeans$size
## [1] 37 40 29 28 19 31

And we can see the cluster assignments for each country:

gap_kmeans$cluster
##   [1] 6 1 3 5 4 3 1 3 2 1 4 2 5 4 2 2 4 6 4 1 2 1 5 2 2 6 6 1 6 3 4 6 6 3 3 5 4
##  [38] 6 1 1 6 2 1 2 2 2 4 1 1 5 1 4 6 2 4 6 4 2 3 1 6 1 3 5 2 4 1 6 6 4 6 1 2 2
##  [75] 5 5 3 3 2 2 3 1 3 1 3 5 4 2 1 1 2 1 1 6 2 2 2 6 6 3 4 6 4 1 2 3 4 1 2 4 5
## [112] 6 5 1 6 3 2 1 6 5 5 2 2 2 5 1 1 1 1 5 5 3 2 2 3 3 6 4 4 3 1 2 4 6 2 2 2 4
## [149] 6 3 3 6 3 1 4 4 5 4 2 2 1 1 6 3 4 6 4 2 1 3 2 6 3 2 3 3 1 5 4 3 5 6 6 1

And finally the cluster centers. These are the prototypes; the average value of each variable for all the observations assigned to a given cluster (note these are the log-transformed and scaled data)

gap_kmeans$centers
##   child_mortality fertility per_cap_co2    income life_expectancy population
## 1      -1.1231799 0.9298109   0.4160478  8.920571        4.281636   15.71450
## 2      -2.2168563 0.5480215   2.1312914 10.456225        4.364861   15.17845
## 3      -1.9366297 0.6165060   2.0042540 10.167852        4.356852   17.84749
## 4      -1.1392574 0.9532052   0.4873032  9.055159        4.270567   12.78904
## 5      -0.7965249 1.0045012   0.1795556  8.757789        4.262115   18.22769
## 6      -0.1801359 1.5266644  -1.7653677  7.397779        4.147750   16.56237

Here, we can see that cluster 6, for example, has the lowest life expectancy and GDP, as well as high infant mortality and fertility rates.

Maps

As these are spatial data, we can now plot the distribution of the clusters. First, we’ll load the world shapefile in the ne_50m_admin_0_countries folder (we used this in a previous lab:)

library(sf)
## Linking to GEOS 3.8.1, GDAL 3.1.4, PROJ 6.3.1
borders = st_read("../datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp")
## Reading layer `ne_50m_admin_0_countries' from data source `/Users/u0784726/Dropbox/Data/devtools/geog5160/datafiles/ne_50m_admin_0_countries/ne_50m_admin_0_countries.shp' using driver `ESRI Shapefile'
## Simple feature collection with 241 features and 94 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -89.99893 xmax: 180 ymax: 83.59961
## geographic CRS: WGS 84

Next, we add the cluster assignments to the gap2_df data frame:

gap_df_scale2$kmeans = as.factor(gap_kmeans$cluster)

Now we need to merge the gap2_df data frame with the world borders spatial object. To do this, we add a new column to the data frame containing the ISO3 standard country codes, using the countrycode package (you’ll need to install this).

library(countrycode)
gap_df_scale2 <- gap_df_scale2 %>% 
  mutate(ISO3 = countrycode(country, "country.name", "iso3c"))

Finally, we use these ISO codes to merge the two datasets. We need to specify the column name for each dataset that contains the label to be used in merging, which we do with by.x and by.y.

cluster_sf = merge(borders, gap_df_scale2, by.x = "ADM0_A3", by.y = "ISO3")

Finally, we can plot out the clusters using the tmap package, which highlights the position of this poorer cluster 6 in Saharan and sub-Saharan Africa:

library(tmap)
tm_shape(cluster_sf) + 
  tm_fill("kmeans", palette = "Set2") +
  tm_legend(legend.position = c("left", "bottom"))

Hierarchical cluster analysis

We’ll briefly demonstrate the use of hierarchical clustering here. This works a little differently in R, as it requires you to first calculate a dissimilarity matrix (the aggregate difference between each pair of observations), and then use this for clustering.

The R function dist() will calculate this, based on simple Euclidean dissimiarity between samples:

gap_d <- dist(gap_df_scale2[,3:8])

We can now use this to carry out hierarchical clustering using the hclust() function:

gap_hclust <- hclust(gap_d)

And you can then plot the resulting dendrogram with:

plot(gap_hclust)

To actually get a cluster assigment for each observation, we need to ‘cut’ the tree in a way that it will result in the desired number of groups, using the well-named cutree() function, To get 6 clusters:

gap_cluster <- cutree(gap_hclust, 6)
head(gap_cluster)
## [1] 1 2 3 1 4 3

To see what this has done, you can overlay the clusters on the dendogram as follows:

plot(gap_hclust)
rect.hclust(gap_hclust, k = 6)

The clusters that you obtain from the cutree() function can be used in the same way that we used the \(k\)-means clusters. Here, we’ll append them to the spatial object (cluster_sf) and map them out:

gap_df_scale2$hclust <- as.factor(gap_cluster)
cluster_sf = merge(borders, gap_df_scale2, by.x = "ADM0_A3", by.y = "ISO3")

tm_shape(cluster_sf) + 
  tm_fill("hclust", palette = "Set2") +
  tm_legend(legend.position = c("left", "bottom"))

Self-organizing maps

We’ll now repeat this analysis with a self-organizing map (SOM), but using the full dataset. For this we’ll use the kohonen package, which allows the construction of unsupervised and supervised SOMs.

library(kohonen)

Building the SOM is relatively simple: first we construct the grid of nodes (30x20) using the somgrid() function. Note the topo argument that controls the neighborhood shape (rectangular or hexagonal):

som_grid = somgrid(xdim = 30, ydim = 20, topo = "hexagonal")

You can plot the empty grid to see the layout with (plot(som_grid)).

Now we’ll train the SOM. The function we use is som(), and we need to specify the data to be used as well as the SOM grid. The function requires the data to be as a matrix rather than an R data frame, so we first convert this, then pass it to the function:

gap_mat = as.matrix(gap_df_scale[,3:8])
gap_som = som(gap_mat, som_grid)

Once finished, we need to see if the algorithm has converged. We can do this by plotting the change in the loss function as a function of the iterations. As a reminder, the loss function here is the mean distance between the node weights and the observations assigned to that node:

plot(gap_som, type = "changes")

We can see here that the loss function has decreased, but not stabilized suggesting that the algorithm has converged. We’ll re-run it for a larger number of iterations by setting rlen to 1000 (this may take a couple of minutes to run):

gap_som = som(gap_mat, som_grid, rlen = 1000)
plot(gap_som, type = "changes")

This plot shows a series of step like drops in the loss function, finally stablizing at approximately iteration 1600.

Plots

Next we’ll go through some visualizations of the data. We’ve already see then loss function changes, so next we’ll make a couple of plots to show how the SOM has trained. First a plot of distances between nodes, which can be useful to identify outliers:

plot(gap_som, type = "dist")

Next, we plot the number of observations per cluster to look for empty nodes.

plot(gap_som, type = "counts")

THe map has a couple of empty nodes. Empty nodes can be common when using larger grid sizes, and can be taken to represent a part of the parameter space that is missing in the observations.

Next, we’ll plot the codebook vectors. These are the weights associated with each feature, and can be used to interpret the organization of the SOM. For each node, there is a radial plot showing the value of each feature associated with it. Note that the type of plot can change here. If you have a large number of features, this will plot as a line plot rather than a radial plot.

plot(gap_som, type = "code")

The map shows a clear gradient from poorer regions (gdpPercap, orange) on the left to richer on the right, and a gradient from smaller populations at the top to smaller at the bottom. Note that the life expectancy follows the gradient of GDP fairly well, and the poorer regions tend to have higher infant mortality and fertility rates.

It is also possible to plot out the individual feature weights as a heatmap to illustrate the gradient using type = "property". The codebook vectors of weights are contained within the gap_som object. The syntax to extract them is a little complicated, and some examples are given below

  • Population
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"population"],
     main = "Pop")

  • Life expectancy
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"life_expectancy"],
     main = "Life Exp.")

  • Income (per capita GDP)
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"income"],
     main = "Income")

  • Per capita CO2 emissions
plot(gap_som, type = "property", property = gap_som$codes[[1]][,"per_cap_co2"],
     main = "Per capita CO2")

Cluster analysis

As a next step, we’ll cluster the nodes of the map to try and identify homogenous regions to simplify the interpretation of the SOM. While \(k\)-means is often used here, we’ll use a hierarchical algorithm (for reasons that will be clear shortly) to form 6 clusters. Again, we first calculate the dissimilarity matrix, but this time between the values (codes) of each of the SOM nodes. Then we use the cutree() and hclust() function to form the clusters

dist_codes <- dist(gap_som$codes[[1]])
som_cluster <- cutree(hclust(dist_codes), 6)

We can now show this on the SOM grid using the plot() function, setting type to mapping and using the cluster values in som_cluster to set the colors. The cluster boundaries can also be overlaid with add.cluster.boundaries()

# Define a color palette
mypal = brewer.pal(6, "Set2")
plot(gap_som, type="mapping", bgcol = mypal[som_cluster], 
     main = "Clusters", pch = '.') 
add.cluster.boundaries(gap_som, som_cluster)

The main problem with this is that the clusters are not-contiguous on the SOM grid. This reflects the non-linear mapping that a SOM carries out (and may also suggest that 6 clusters is too high). We can force the clusters to be contiguous by including the distance between the nodes on the SOM grid as well as the dissimilarity between codebook values. We’ve already calulated this latter value above (as codes_dist), so let’s get the distance between the nodes using the handy helper function unit.distances():

dist_grid <- unit.distances(som_grid)

We then multiply these two distance/dissimilarity matrices together:

dist_adj <- as.matrix(dist_codes) * dist_grid

If we now repeat the clustering, the inclusion of the distances between nodes now forces these clusters to be contguous:

clust_adj = hclust(as.dist(dist_adj), 'ward.D2')
som_cluster_adj = cutree(clust_adj, 6)

plot(gap_som, type="mapping", bgcol = mypal[som_cluster_adj], 
     main = "Clusters", pch = '.') 
add.cluster.boundaries(gap_som, som_cluster_adj)

Next, we obtain aggregate values for each cluster. While we can get these values from the gap_som object, we can also work back to get values from the original data (in gap_df). In the following code we:

  • Extract a vector with the node assignments for each country
  • Use this to get the cluster number for the node that the country is assigned to
  • Attach this to the original gap2_df dataframe as a factor (rather than an integer)
nodeByCountry <- gap_som$unit.classif
gap_df_scale$somclust <- as.factor(som_cluster[nodeByCountry])

To get the values for each cluster on the original, non-transformed scale, we need to merge gap_df_scale which contains the cluster assignments with the original data (gap_df)

gap_df_clus <- merge(gap_df, gap_df_scale, by = c("country", "year"))

Now we can use this list of clusters by country to aggregate the variables:

cluster.vals <- aggregate(gap_df_clus[,3:8], 
                         by = list(gap_df_scale$somclust), mean)

Which gives the following table:

Group.1 child_mortality.x fertility.x per_cap_co2.x income.x life_expectancy.x population.x
1 153.70910 4.020676 2.6374468 5409.448 56.54863 120089419
2 186.33331 5.170498 0.9940760 3628.883 52.19162 6800004
3 33.41449 2.440454 8.1866828 21569.420 72.38448 27421607
4 304.61776 6.034029 0.0550169 1296.499 39.62450 80358324
5 44.10711 3.214266 12.5877154 28298.458 69.70386 1001145
6 338.75986 6.434567 0.0134140 1217.385 35.91200 5735099

This shows the approximate characteristics for the clusters:

  1. High life expectancy and GDP, low mortality and fertility rates, medium population
  2. Low life expectancy and GDP, very high mortality and fertility rates, medium population
  3. Low life expectancy and GDP, high mortality and fertility rates, small population
  4. Medium life expectancy, low GDP, medium mortality and fertility rates, large population
  5. High life expectancy and GDP, lower mortality and fertility rates, large population
  6. High life expectancy and medium GDP, medium mortality and fertility rates, small population

As we have data over multiple years, we can also use the results of the SOM analysis to track the change made by an individual country over time. To do this, we need to know which node a country was assigned to for each of it’s time steps. To do this we

  • Find all the observation/row numbers belonging to a country (e.g. here for the United States)
  • Find the node assigned to each of those observations
  • Find the node coordinates (on the grid) for each of those nodes
country_id <- which(gap_df_scale$country == "United States")
country_nodes <- gap_som$unit.classif[country_id]
country_crds <- som_grid$pts[country_nodes, ]

Now we plot out the grid again, shaded with the cluster assignments. Then we use R’s line() function to overlay the country’s trajectory. In the last couple lines, we make up a vector of years to label the trajectory. As there will be a lot of these, we only keep the label for 20 year increments. Then we add these to the plot

plot(gap_som, type="mapping", bgcol = mypal[som_cluster_adj], 
     main = "Clusters", pch = '', keepMargins = TRUE) 
#add.cluster.boundaries(gap_som, som_cluster_adj)
lines(country_crds, lwd=2)

years <- ifelse(gap_df_scale$year[country_id] %% 20 == 0, as.character(gap_df_scale$year[country_id]), "")
text(jitter(country_crds), labels = years, cex = 0.75)

And finally, we can plot out the spatial distribution of these clusters for any given year in the dataset. First we need to assign the country codes to link to the shapefile:

gap_df_scale <- gap_df_scale %>% 
  mutate(ISO3 = countrycode(country, "country.name", "iso3c"))

Next we’ll extract a single year and merge it to our spatial object (cluster_sf):

myyear <- 1900
tmp.df <- gap_df_scale %>%
  filter(year == myyear)
cluster_sf = merge(borders, tmp.df, by.x = "ADM0_A3", by.y = "ISO3",
                   all = TRUE)

And finally, plot it:

tm_shape(cluster_sf) + 
  tm_fill("somclust", palette = "Set2") +
  tm_layout(main.title = paste("GAPMinder data",  myyear))

Exercise

For the exercise, you will need to build use an unsupervised classification method with the California housing dataset using a self-organizing map. Your answer should use the following variables (you can use code from a previous example to create the average number of rooms per house, and the ratio of bedrooms to rooms):

  • housing_median_age
  • population
  • median_income
  • median_house_value
  • avg_rooms
  • bedroom_ratio

Your answer should consist of the following:

  • Your R code
  • A plot showing that the loss function has stabilized
  • A plot of the counts per nodes
  • A plot of the codebook vectors
  • One or more heatmaps of the individual features on the SOM
  • A set of clusters constructed from the SOM nodes using the method in the lab (and a figure showing these on the SOM grid)
  • A table given the average values of each feature per cluster

You do not need to provide a map of the SOM clusters, but if you do, the following code will be useful:

  • Run this before selecting the subset of features for use in the classification. This makes a copy of the district coordinates
housing.crds = housing %>%
  select(longitude, latitude)
  • Run this after the cluster analysis to add the clusters and coordinates back to the housing data frame, and convert to a Spatial* object. This assumes that the SOM output is stored in housing.som and the cluster assignment in som.cluster:
housing$cluster = as.factor(som.cluster[housing.som$unit.classif])
housing = cbind(housing, housing.crds)
coordinates(housing) = ~longitude+latitude